-
Notifications
You must be signed in to change notification settings - Fork 267
perf: Implement specialized aggregates for COUNT(*) and COUNT(expr)
#2397
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2397 +/- ##
============================================
+ Coverage 56.12% 57.51% +1.39%
- Complexity 976 1295 +319
============================================
Files 119 147 +28
Lines 11743 13469 +1726
Branches 2251 2352 +101
============================================
+ Hits 6591 7747 +1156
- Misses 4012 4457 +445
- Partials 1140 1265 +125 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| - name: "fuzz" | ||
| value: | | ||
| org.apache.comet.CometFuzzTestSuite | ||
| org.apache.comet.CometFuzzAggregateSuite |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot to add this initially, and CI failed thanks to the recently added checks for this 😄
Oh I love this.
Did we try if performance improved at all recently? How long ago was this? |
| } | ||
|
|
||
| // Count all rows regardless of null values | ||
| let array = &values[0]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps https://doc.rust-lang.org/std/vec/struct.Vec.html#method.get_unchecked ? we can avoid boundary check if we checked above the values is not empty
This was ~1 year ago. See #744 for more information. Count was ~10x slower than Sum at the time (but only when integrated with Comet). |
I've moved this PR to draft for now. I will split out the testing changes into a separate PR and then experiment again with using DataFusion's count. |
|
Closing this since #2407 is less maintenance |
Which issue does this PR close?
N/A
Rationale for this change
TPC-H q1 time improves from 12.55s to 11.57s (8% speedup).
What changes are included in this PR?
This PR fixes some old tech debt around the way we implemented
COUNTaggregates. Originally, we tried using DataFusion's count aggregate but it was extremely slow when integrated into Comet. We did not find the root cause for this, but instead, we implementedCOUNT(expr)asSUM(IF(expr IS NOT NULL, 1, 0))and this provided better performance. However, this is not efficient since there are intermediate arrays being created from theIFexpression. ForCOUNT(*), which Spark translates toCOUNT(1), this was even more inefficient because1is never null.CountRowsandCountNotNullaggregates that are faster and more memory efficient.SUMapproach is still used for counts with multiple arguments, such asCOUNT(expr1, expr2), which translates toSUM(IF(expr1 IS NOT NULL AND expr2 IS NOT NULL, 1, 0))CometFuzzTestBaseextracted fromCometFuzzTestSuiteCometFuzzAggregateSuiteHow are these changes tested?
New fuzz tests are added. In
CometAggregateSuite, we did not have any tests forcount(*)or for count with multiple expressions.